Frustratingly Short Attention Spans in Neural Language Modeling

نویسندگان

Michal Daniluk

Tim Rocktäschel

Johannes Welbl

Sebastian Riedel

چکیده

Neural language models predict the next token using a latent representation of the immediate token history. Recently, various methods for augmenting neural language models with an attention mechanism over a differentiable memory have been proposed. For predicting the next token, these models query information from a memory of the recent history which can facilitate learning midand long-range dependencies. However, conventional attention mechanisms used in memoryaugmented neural language models produce a single output vector per time step. This vector is used both for predicting the next token as well as for the key and value of a differentiable memory of a token history. In this paper, we propose a neural language model with a key-value attention mechanism that outputs separate representations for the key and value of a differentiable memory, as well as for encoding the next-word distribution. This model outperforms existing memoryaugmented neural language models on two corpora. Yet, we found that our method mainly utilizes a memory of the five most recent output representations. This led to the unexpected main finding that a much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distraction-Based Neural Networks for Modeling Documents

Distributed representation learned with neural networks has recently shown to be effective in modeling natural languages at fine granularities such as words, phrases, and even sentences. Whether and how such an approach can be extended to help model larger spans of text, e.g., documents, is intriguing, and further investigation would still be desirable. This paper aims to enhance neural network...

متن کامل

Attention-based Memory Selection Recurrent Network for Language Modeling

Recurrent neural networks (RNNs) have achieved great success in language modeling. However, since the RNNs have fixed size of memory, their memory cannot store all the information about the words it have seen before in the sentence, and thus the useful longterm information may be ignored when predicting the next words. In this paper, we propose Attention-based Memory Selection Recurrent Network...

متن کامل

Distraction-Based Neural Networks for Document Summarization

متن کامل

Short Term Load Forecasting by Using ESN Neural Network Hamedan Province Case Study

Abstract Forecasting electrical energy demand and consumption is one of the important decision-making tools in distributing companies for making contracts scheduling and purchasing electrical energy. This paper studies load consumption modeling in Hamedan city province distribution network by applying ESN neural network. Weather forecasting data such as minimum day temperature, average day temp...

متن کامل

Autoregressive Attention for Parallel Sequence Modeling

We introduce an autoregressive attention mechanism for parallelizable characterlevel sequence modeling. We use this method to augment a neural model consisting of blocks of causal convolutional layers connected by highway network skip connections. We denote the models with and without the proposed attention mechanism respectively as Highway Causal Convolution (Causal Conv) and Autoregressive-at...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1702.04521 شماره

صفحات -

تاریخ انتشار 2017

Frustratingly Short Attention Spans in Neural Language Modeling

نویسندگان

چکیده

منابع مشابه

Distraction-Based Neural Networks for Modeling Documents

Attention-based Memory Selection Recurrent Network for Language Modeling

Distraction-Based Neural Networks for Document Summarization

Short Term Load Forecasting by Using ESN Neural Network Hamedan Province Case Study

Autoregressive Attention for Parallel Sequence Modeling

عنوان ژورنال:

اشتراک گذاری